1 Background

This document contains background and examples to illustrate the possible extensions to BRISC that we have been discussing.

1.1 Spatial transcriptomics

We are interested in applying methods such as BRISC (Saha and Datta 2018) and GpGp (Guinness 2018) to identify “spatially variable genes” in spatial transcriptomics data.

In spatial transcriptomics data, we are measuring transcriptome-wide gene expression (i.e. gene expression of all ~30,000 or so genes or transcripts in the human transcriptome) at a grid of approximately 5,000 spatial coordinates on a tissue slide. In the commercially available 10x Genomics Visium platform, the grid has dimensions approximately 6.5mm x 6.5mm, and the spatial coordinates (“spots”) are located in a regular hexagonal grid.

The following figure illustrates the platform.

10x Genomics Visium schematic (from 10x Genomics website).

10x Genomics Visium schematic (from 10x Genomics website).

1.2 Spatially variable genes

The measurements from the Visium platform consist of sequencing read counts for the ~30,000 genes or transcripts at the ~5,000 spatial coordinates (spots), which are stored in a 30,000 x 5,000 counts matrix (note we usually store features in rows and observations in columns in Bioconductor, i.e. p x n instead of n x p). The matrix is highly sparse, due to non-detection of genes resulting from the sensitivity of the technology.

We then perform several initial preprocessing steps, including quality control and filtering, normalization to correct for biases such as different sequencing depth per spot, and log transformation to stabilize variance and transform to a continuous scale. These steps are fairly standard and are adapted from existing analysis pipelines for single-cell RNA sequencing data (although alternatives also exist, e.g. count-based models instead of log-transformation).

Next, we would like to identify a subset of “biologically informative” genes, or rank all genes by their biological information content and then take the top n% (e.g. top 10%). This is done to reduce noise (there are usually thousands of genes that are not related to any of the biological phenomena of interest in the dataset, so removing these tends to leave a clearer signal), and to improve computational scalability in later steps.

In the context of spatial transcriptomics, “biologically informative” means genes that have a spatially variable expression pattern across the tissue slide. For example, this could be related to spatial distributions of cell types, or spatial gradients of expression across the slide dimensions. These genes are referred to as “spatially variable genes” (SVGs), and we would like to identify them by either:

  • identifying a set of statistically significant SVGs (with some assumptions on e.g. kernel and bandwidth), or
  • ranking all genes (e.g. by the estimated spatial variance component) and taking the top n% (e.g. top 10%)

In practice, this means we run a method such as BRISC once per gene (i.e. a loop over the 30,000 genes, or ~16,000 genes after filtering low-expressed genes) to estimate the significance of the spatial variance parameter (sigma squared) or compare the fitted model against a null model without spatial terms, and then rank genes by the resulting p-values. We can also calculate an effect size, e.g. the proportion of spatial variance (sigma_sq / (sigma_sq + tau_sq)).

1.3 Existing methods

Existing methods for these types of analyses include:

However, both of these scale cubically in the number of spatial coordinates (spots). This means they can be used for earlier generations of the spatial transcriptomics platform (which had only a few hundred spots per slide), but cannot be used with the newer platform (5,000 spots per slide).

More recently, the following paper was published in June 2021:

This method scales linearly with the number of spots and is extremely fast overall. However, in our initial evaluations, it does not work very well on our datasets in biological terms, i.e. the top-ranked genes do not look like true SVGs when we investigate them more closely. Specifically, it looks like the ranking of SVGs is driven by small numbers of outlier spots that are hard to control for – however we are still investigating this more closely.

1.4 BRISC and GpGp

The NNGP and Vecchia’s approximation methods in BRISC and GpGp also allow us to fit models that scale linearly with the number of spots, and in our initial evaluations, give more sensible (i.e. biologically meaningful) final rankings of SVGs than SPARK-X.

Therefore, we would like to build a framework around BRISC for ranking SVGs in spatial transcriptomics datasets. In principle, this could be a simple parallelized wrapper function around BRISC, which runs BRISC once per gene and extracts the parameter estimates and other outputs. I have written an example function here, and a similar one for GpGp here.

Crucially, the linear scaling with the number of spots means that we can expect to continue to be able to apply these methods even with possible future generations of the spatial transcriptomics technology with even larger numbers of spots per tissue slide. Compared to the earlier generation of cubically scaling methods, this is a game-changer, and could potentially lead to wide adoption of these methods. Other than SPARK-X (which does not work very well on our datasets) and much simpler methods such as Moran’s I or Geary’s C (which we can use as a baseline comparison), we are not aware of any other methods to identify SVGs that scale linearly.

2 BRISC extensions

In our initial work with BRISC, we have identified three possible extensions, which we think would be very useful for this project. The first two relate to computational scalability to run on thousands of genes, and the third is for approximate likelihood ratio tests.

Below I load some publicly available datasets and demonstrate some examples for these three ideas.

2.1 Install packages

I will use the following packages from version 3.14 (the current devel version) of Bioconductor.

# install Bioconductor 3.14
install.packages("BiocManager")
BiocManager::install(version = "3.14")

# Bioconductor packages
BiocManager::install("SpatialExperiment")
BiocManager::install("STexampleData")
BiocManager::install("scater")
BiocManager::install("scran")

I will also use GpGp and the following fork of BRISC. The only change in this BRISC fork is to output some additional runtimes.

install.packages("GpGp")

# fork of BRISC
remotes::install_github("lmweber/BRISC")

2.2 Load data and preprocessing

Here I load one of our publicly available example datasets, run several standard preprocessing steps to get it to the point where we can apply methods to identify SVGs, and then downsample to a few hundred genes (instead of 30,000) for the purposes of the examples in this document.

A complete workflow example of the preprocessing steps (and the subsequent downstream analysis steps) is available in one of our online resources here.

This dataset measures gene expression in a small tissue sample of human brain from the dorsolateral prefrontal cortex (DLPFC) region, and was described in our publication Maynard and Collado-Torres et al. 2021.

library(SpatialExperiment)
library(STexampleData)
library(scater)
library(scran)
library(ggplot2)
library(dplyr)
library(tidyr)
# load dataset as SpatialExperiment object from STexampleData package
spe <- Visium_humanDLPFC()
## see ?STexampleData and browseVignettes('STexampleData') for documentation
## loading from cache
# dataset is stored in 'genes x spots' format
dim(spe)
## [1] 33538  4992
# metadata describing features and observations
head(rowData(spe))
## DataFrame with 6 rows and 3 columns
##                         gene_id   gene_name    feature_type
##                     <character> <character>     <character>
## ENSG00000243485 ENSG00000243485 MIR1302-2HG Gene Expression
## ENSG00000237613 ENSG00000237613     FAM138A Gene Expression
## ENSG00000186092 ENSG00000186092       OR4F5 Gene Expression
## ENSG00000238009 ENSG00000238009  AL627309.1 Gene Expression
## ENSG00000239945 ENSG00000239945  AL627309.3 Gene Expression
## ENSG00000239906 ENSG00000239906  AL627309.2 Gene Expression
head(colData(spe))
## DataFrame with 6 rows and 3 columns
##                    cell_count ground_truth     sample_id
##                     <integer>     <factor>   <character>
## AAACAACGAATAGTTC-1         NA       NA     sample_151673
## AAACAAGTATCTCCCA-1          6       Layer3 sample_151673
## AAACAATCTACTAGCA-1         16       Layer1 sample_151673
## AAACACCAATAACTGC-1          5       WM     sample_151673
## AAACAGAGCGACTCCT-1          2       Layer3 sample_151673
## AAACAGCTTTCAGAAG-1          4       Layer5 sample_151673
head(spatialData(spe))
## DataFrame with 6 rows and 6 columns
##                            barcode_id in_tissue array_row array_col
##                           <character> <integer> <integer> <integer>
## AAACAACGAATAGTTC-1 AAACAACGAATAGTTC-1         0         0        16
## AAACAAGTATCTCCCA-1 AAACAAGTATCTCCCA-1         1        50       102
## AAACAATCTACTAGCA-1 AAACAATCTACTAGCA-1         1         3        43
## AAACACCAATAACTGC-1 AAACACCAATAACTGC-1         1        59        19
## AAACAGAGCGACTCCT-1 AAACAGAGCGACTCCT-1         1        14        94
## AAACAGCTTTCAGAAG-1 AAACAGCTTTCAGAAG-1         1        43         9
##                    pxl_col_in_fullres pxl_row_in_fullres
##                             <integer>          <integer>
## AAACAACGAATAGTTC-1               2435               3913
## AAACAAGTATCTCCCA-1               8468               9791
## AAACAATCTACTAGCA-1               2807               5769
## AAACACCAATAACTGC-1               9505               4068
## AAACAGAGCGACTCCT-1               4151               9271
## AAACAGCTTTCAGAAG-1               7583               3393
head(spatialCoords(spe))
##                       x    y
## AAACAACGAATAGTTC-1 3913 2435
## AAACAAGTATCTCCCA-1 9791 8468
## AAACAATCTACTAGCA-1 5769 2807
## AAACACCAATAACTGC-1 4068 9505
## AAACAGAGCGACTCCT-1 9271 4151
## AAACAGCTTTCAGAAG-1 3393 7583
# sparse matrix of sequencing read counts
counts(spe)[85:90, 1:6]
## 6 x 6 sparse Matrix of class "dgTMatrix"
##                 AAACAACGAATAGTTC-1 AAACAAGTATCTCCCA-1 AAACAATCTACTAGCA-1
## ENSG00000187730                  .                  3                  1
## ENSG00000226969                  .                  .                  .
## ENSG00000067606                  .                  1                  .
## ENSG00000271806                  .                  .                  .
## ENSG00000182873                  .                  .                  .
## ENSG00000162585                  .                  .                  .
##                 AAACACCAATAACTGC-1 AAACAGAGCGACTCCT-1 AAACAGCTTTCAGAAG-1
## ENSG00000187730                  1                  1                  .
## ENSG00000226969                  .                  .                  .
## ENSG00000067606                  2                  1                  1
## ENSG00000271806                  .                  .                  .
## ENSG00000182873                  .                  .                  .
## ENSG00000162585                  .                  .                  2
# run preprocessing steps

# subset to keep only spots overlapping with tissue
spe <- spe[, spatialData(spe)$in_tissue == 1]

# filter low-quality spots

# identify mitochondrial genes
is_mito <- grepl("(^MT-)|(^mt-)", rowData(spe)$gene_name)
# calculate per-spot QC metrics
spe <- addPerCellQC(spe, subsets = list(mito = is_mito))
# select QC thresholds
qc_lib_size <- colData(spe)$sum < 500
qc_detected <- colData(spe)$detected < 250
qc_mito <- colData(spe)$subsets_mito_percent > 30
qc_cell_count <- colData(spe)$cell_count > 12
# combined set of discarded spots
discard <- qc_lib_size | qc_detected | qc_mito | qc_cell_count
colData(spe)$discard <- discard
table(discard)
## discard
## FALSE  TRUE 
##  3582    57
# remove low-quality spots
spe <- spe[, !colData(spe)$discard]

# normalization and log-transformation

# quick clustering for pool-based size factors
set.seed(123)
qclus <- quickCluster(spe)
# calculate size factors
spe <- computeSumFactors(spe, cluster = qclus)
# calculate logcounts (log-transformed normalized counts)
spe <- logNormCounts(spe)

# remove mitochondrial genes

# remove mitochondrial genes since these tend to be highly 
# expressed and not biologically informative
spe <- spe[!is_mito, ]
message("removing ", sum(is_mito), " mitochondrial genes out of total ", nrow(spe), " genes")
## removing 13 mitochondrial genes out of total 33525 genes
dim(spe)
## [1] 33525  3582

Next, we filter genes with extremely low expression, which I have arbitrarily defined as less than or equal to 10 sequencing read counts (“unique molecular identifier” or UMI counts) across all spots combined. We also have a preprocessing function that performs some of these steps here.

# filter low-expressed genes
n_umi <- 10
sums <- rowSums(counts(spe))
ix_remove <- sums <= n_umi
message("removing ", sum(ix_remove), " out of ", nrow(spe), " genes due to low counts")
## removing 17161 out of 33525 genes due to low counts
spe <- spe[!ix_remove, ]
dim(spe)
## [1] 16364  3582

2.3 Example plots and downsampling

Here, we can show some illustrative plots for some genes with known spatial expression patterns in this dataset. These plots show log-transformed normalized expression of each gene in the x-y coordinates of the tissue slide.

# known interesting genes in this dataset
interesting <- c("MOBP", "SNAP25", "PCP4", "HBB", "IGKC", "NPY")

# generate plots
for (i in seq_along(interesting)) {
  gene <- interesting[i]
  ix_gene <- which(rowData(spe)$gene_name == gene)
  
  df <- as.data.frame(cbind(
    spatialCoords(spe), 
    logexpr = logcounts(spe)[ix_gene, ]
  ))
  
  p <- ggplot(df, aes(x = x, y = y, color = logexpr)) + 
    geom_point(size = 0.2) + 
    coord_fixed() + 
    scale_y_reverse() + 
    scale_color_gradient(low = "gray90", high = "blue") + 
    ggtitle(gene) + 
    theme_bw() + 
    theme(panel.grid = element_blank(), 
          axis.title = element_blank(), 
          axis.text = element_blank(), 
          axis.ticks = element_blank())
  
  print(p)
}

Finally, downsample the object to keep only a few hundred random genes, plus the known interesting genes from above. We do this only for computational scalability for the examples in this document, and would not normally do this in an analysis, since we are interested in ranking all genes that pass the filtering thresholds.

# downsample to keep n random genes
n <- 200
set.seed(123)
ix_keep <- sample(seq_len(nrow(spe)), n)

# also keep known interesting genes
ix_interesting <- which(rowData(spe)$gene_name %in% interesting)
ix_keep <- c(ix_interesting, ix_keep)

spe <- spe[ix_keep, ]
dim(spe)
## [1]  206 3582

3 Extension 1: Re-use sorting of coordinates

The first possible extension we have identified is to re-use the sorting of coordinates, which is performed in the initial steps in BRISC_estimation().

In our setup, we fit ~16,000 models in a (parallelized) loop, with a different response vector for each of the ~16,000 genes, and the same grid of spatial coordinates (spots) each time.

From my understanding of the sorting schemes (as described by Guinness 2018), this means we could re-use the sorted coordinates in the loop iterations – i.e. calculate the sorted coordinates once, and pass this as an input to a modified version of BRISC_estimation() in the loop instead of calculating it in each loop iteration. This could potentially save a huge amount of runtime in our loop.

Here I output some runtimes from BRISC_estimation() to estimate the amount of runtime we could save.

For this gene, sorting coordinates takes 1.8 seconds out of 2.8 seconds total.

library(BRISC)
# response vector for a single gene
# i.e. 1 gene, 3582 spots overlapping with tissue
y_i <- logcounts(spe)[10, ]
length(y_i)
## [1] 3582
# scale coordinates proportionally
coords <- spatialCoords(spe)
range_all <- max(apply(coords, 2, function(col) diff(range(col))))
coords <- apply(coords, 2, function(col) (col - min(col)) / range_all)

# covariates
x <- NULL

# run BRISC to show runtimes
system.time({
  BRISC_estimation(coords = coords, y = y_i, x = x, 
                   n.neighbors = 15, order = "AMMD", 
                   cov.model = "exponential", search.type = "cb", 
                   verbose = TRUE)
})
## Warning in BRISC_estimation(coords = coords, y = y_i, x = x, n.neighbors = 15, : The ordering of inputs x (covariates) and y (response) in BRISC_estimation has been changed BRISC 1.0.0 onwards.
## Please check the new documentation with ?BRISC_estimation.
## ---------------------------------------- 
##  Ordering Coordinates 
## runtime for ordering:
##    user  system elapsed 
##   1.663   0.140   1.812 
## ----------------------------------------
##  Model description
## ----------------------------------------
## BRISC model fit with 3582 observations.
## 
## Number of covariates 1 (including intercept if specified).
## 
## Using the exponential spatial correlation model.
## 
## Using 15 nearest neighbors.
## 
## 
## 
## Source not compiled with OpenMP support.
## ----------------------------------------
##  Building neighbor index
## ----------------------------------------
##  Performing optimization
## ----------------------------------------
##  Processing optimizers
## ----------------------------------------
##    user  system elapsed 
##   2.501   0.255   2.771

4 Extension 2: Re-use nearest neighbors

The second possible extension we have identified is to re-use the array of nearest neighbors.

The nearest neighbors are currently calculated internally in BRISC_estimation(), so in our loop setup these are re-calculated in each iteration. If it is possible to calculate the nearest neighbors once at the start and then pass this as an input in the loop iterations, this could potentially also save a large amount of runtime.

Here is an example using GpGp, which is slower overall but can be used to demonstrate these two options.

library(GpGp)
# response vector for a single gene
# i.e. 1 gene, 3582 spots overlapping with tissue
y_i <- logcounts(spe)[10, ]
length(y_i)
## [1] 3582
# sort coordinates
coords <- spatialCoords(spe)
# calculate ordering only once
ord <- order_maxmin(coords)
# calculate nearest neighbors only once
nn <- find_ordered_nn(coords[ord, ], m = 15)

# covariates
x <- NULL

system.time({
  # note: using manual reordering, pre-calculated nearest neighbors
  fit_model(y = y_i[ord], locs = coords[ord, ], X = x[ord, ], 
            covfun_name = "exponential_isotropic", 
            NNarray = nn, reorder = FALSE, m_seq = 15, 
            silent = TRUE)
})
##    user  system elapsed 
##   1.127   0.010   1.142

5 Extension 3: Likelihood ratio tests

The third extension relates to approximate inference using likelihood ratio (LR) tests.

The main output we use to rank genes as SVGs is the estimated significance / p-value for the spatial variance parameter (sigma.sq in BRISC or sigmasq in GpGp). (We also use the estimated proportion of spatial variance as an effect size, i.e. sigma.sq / (sigma.sq + tau.sq) from BRISC or sigmasq / (sigmasq + tausq) from GpGp.)

Using the bootstrap inference from BRISC for 16,000 genes is quite slow in this case, so we have also investigated the use of a LR test for approximate inference. This is similar to the approach used by one of the earlier methods for SVGs (SpatialDE, Svensson et al. 2018).

However, calculating the LR tests requires us to obtain the fitted log-likelihoods for the models from each gene, so that we can calculate the deviance against a null model without spatial terms. From looking at the BRISC code, I think it is possible to extract this, but due to my inexperience with C++ / Rcpp I haven’t been able to do this correctly yet.

As an example, using GpGp we can extract the fitted log-likelihoods as loglik <- out$loglik, and then calculate the LR tests using lm to fit the non-spatial model. My code for this is in this function: https://github.com/lmweber/spatzli/blob/main/R/runSVGsGpGp.R

Here I do this for the 206 genes in our small object from earlier and using the function from above. (This takes ~3 minutes to run with 4 cores on my laptop.)

The outputs are stored in the rowData() slot (which contains features metadata such as gene IDs) in the SpatialExperiment object.

Then, we can rank genes by the p-values, which shows that the known genes with highly spatial expression patterns are highly significant, as expected.

For comparison, we also show what the spatial expression patterns of some nonsignificant SVGs look like.

Note that if we can extract log-likelihoods from BRISC, then we would ideally want constant terms to be included, so that the values can be easily compared against the null (non-spatial) model from lm in the LR tests.

remotes::install_github("lmweber/spatzli")
library(spatzli)
# fit GpGp models
runtime_gpgp <- system.time({
  spe_gpgp <- runSVGsGpGp(spe, x = NULL, n_threads = 4)
})

# runtime
runtime_gpgp
##    user  system elapsed 
## 581.190   9.383 165.375
# calculate ranks according to p-values
rowData(spe_gpgp)$rank_gpgp_pval <- rank(rowData(spe_gpgp)$pval, ties.method = "first")
# calculate ranks according to effect sizes
rowData(spe_gpgp)$rank_gpgp_prop_sv <- rank(-1 * rowData(spe_gpgp)$prop_sv, ties.method = "first")
# show output format
head(rowData(spe_gpgp))
## DataFrame with 6 rows and 19 columns
##                         gene_id   gene_name    feature_type   sigmasq
##                     <character> <character>     <character> <numeric>
## ENSG00000211592 ENSG00000211592        IGKC Gene Expression  0.597180
## ENSG00000168314 ENSG00000168314        MOBP Gene Expression  0.853877
## ENSG00000122585 ENSG00000122585         NPY Gene Expression  0.251057
## ENSG00000244734 ENSG00000244734         HBB Gene Expression  0.337481
## ENSG00000132639 ENSG00000132639      SNAP25 Gene Expression  0.521013
## ENSG00000183036 ENSG00000183036        PCP4 Gene Expression  0.218254
##                 tausq_sigmasq    loglik   runtime     tausq      mean       var
##                     <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000211592      0.747644  -4461.68     0.810  0.446478  0.622402  1.005388
## ENSG00000168314      0.444265  -3569.78     1.042  0.379347  0.817135  1.258807
## ENSG00000122585      1.292681  -3961.78     0.885  0.324537  0.396202  0.577218
## ENSG00000244734      1.078877  -3995.77     0.970  0.364100  0.415438  0.713421
## ENSG00000132639      0.808306  -3763.57     0.823  0.421138  3.473243  0.823830
## ENSG00000183036      2.056803  -3899.88     0.924  0.448906  0.691926  0.677645
##                     spcov  ratio_sv   prop_sv loglik_lm   lr_stat      pval
##                 <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000211592  1.241600  1.337535  0.572199  -5091.76  1260.169         0
## ENSG00000168314  1.130846  2.250910  0.692394  -5494.36  3849.157         0
## ENSG00000122585  1.264648  0.773586  0.436171  -4097.92   272.270         0
## ENSG00000244734  1.398356  0.926889  0.481029  -4477.35   963.152         0
## ENSG00000132639  0.207821  1.237155  0.553004  -4735.06  1942.965         0
## ENSG00000183036  0.675183  0.486191  0.327139  -4385.20   970.653         0
##                      padj rank_gpgp_pval rank_gpgp_prop_sv
##                 <numeric>      <integer>         <integer>
## ENSG00000211592         0              1                 4
## ENSG00000168314         0              2                 2
## ENSG00000122585         0              3                 7
## ENSG00000244734         0              4                 6
## ENSG00000132639         0              5                 5
## ENSG00000183036         0              6                10
# how many significant SVGs are detected? (using adjusted p-values)
table(rowData(spe_gpgp)$padj <= 0.05)
## 
## FALSE  TRUE 
##   151    55
# how many "highly" significant SVGs?
table(rowData(spe_gpgp)$pval == 0)
## 
## FALSE  TRUE 
##   183    23
# gene names for highly significant SVGs (note: includes the 6 known genes)
sig_genes <- rowData(spe_gpgp)$gene_name[rowData(spe_gpgp)$pval == 0]
sig_genes
##  [1] "IGKC"       "MOBP"       "NPY"        "HBB"        "SNAP25"    
##  [6] "PCP4"       "SLC5A11"    "NSF"        "INF2"       "MPPED1"    
## [11] "NUPR1"      "P2RX7"      "CAMK2N1"    "MAP7D2"     "SYN1"      
## [16] "CAMK4"      "STX1A"      "SPARCL1"    "LDB2"       "ADARB2-AS1"
## [21] "LINGO1"     "RPL12"      "SLAIN1"
# plot expression for all highly significant SVGs to check if they look sensible
for (i in seq_along(sig_genes)) {
  gene <- sig_genes[i]
  ix_gene <- which(rowData(spe)$gene_name == gene)
  
  df <- as.data.frame(cbind(
    spatialCoords(spe), 
    logexpr = logcounts(spe)[ix_gene, ]
  ))
  
  p <- ggplot(df, aes(x = x, y = y, color = logexpr)) + 
    geom_point(size = 0.1) + 
    coord_fixed() + 
    scale_y_reverse() + 
    scale_color_gradient(low = "gray90", high = "blue") + 
    ggtitle(gene) + 
    theme_bw() + 
    theme(panel.grid = element_blank(), 
          axis.title = element_blank(), 
          axis.text = element_blank(), 
          axis.ticks = element_blank())
  
  print(p)
}

# gene names for some nonsignificant SVGs (p-values == 1)
nonsig_genes <- rowData(spe_gpgp)$gene_name[rowData(spe_gpgp)$pval == 1]
nonsig_genes
##  [1] "TRIM29"     "BCAS3"      "PTPN12"     "RGL3"       "KCNG2"     
##  [6] "DCLRE1C"    "AKAP3"      "ALKBH8"     "AC016590.3" "SHISA5"    
## [11] "AC026471.4" "TIGD2"      "KANSL1-AS1" "TFAM"       "EBI3"      
## [16] "RBL1"       "MAIP1"      "AEN"        "ZNF575"     "MRC1"      
## [21] "ZC2HC1C"    "MIR200CHG"  "TBL2"       "ZNF766"     "DKC1"      
## [26] "LINC00882"  "PLCE1"      "MAMSTR"     "CNOT2"      "AC010618.3"
## [31] "RRN3"       "ZNF254"     "LENG1"      "AP001267.3" "NPPA-AS1"  
## [36] "USP51"      "ZNF549"     "AL589765.6" "LINC01607"  "ZNF250"    
## [41] "AC007389.5" "NFYA"       "AC092171.3" "RELB"       "ZNF611"
length(nonsig_genes)
## [1] 45
# number of genes to plot
n_plot <- 18

# plot expression for some nonsignificant SVGs (p-values == 1)
for (i in seq_len(n_plot)) {
  gene <- nonsig_genes[i]
  ix_gene <- which(rowData(spe)$gene_name == gene)
  
  df <- as.data.frame(cbind(
    spatialCoords(spe), 
    logexpr = logcounts(spe)[ix_gene, ]
  ))
  
  p <- ggplot(df, aes(x = x, y = y, color = logexpr)) + 
    geom_point(size = 0.1) + 
    coord_fixed() + 
    scale_y_reverse() + 
    scale_color_gradient(low = "gray90", high = "blue") + 
    ggtitle(gene) + 
    theme_bw() + 
    theme(panel.grid = element_blank(), 
          axis.title = element_blank(), 
          axis.text = element_blank(), 
          axis.ticks = element_blank())
  
  print(p)
}

6 Possible future ideas

These are more speculative ideas for now, which we could consider in more detail for future work.

6.1 Faster / multivariate bootstrap inference?

Currently, we can use both bootstrap inference from BRISC and likelihood ratio (LR) tests from GpGp, where we are using the LR tests from GpGp as a fast approximation. The runtime for the full loop over 16,000 genes using GpGp for the LR tests is approximately 5 hours with 4 cores on my laptop. If we can use BRISC for the LR tests instead, we expect this will be faster.

However, would it somehow also be possible to perform the bootstrap inference in a faster setup instead of using our loop? e.g. some sort of multivariate bootstrapping using the full multivariate response for all 16,000 genes (16,000 genes measured at 3,500 spatial coordinates). I’m not sure if this makes sense (since we want to estimate the spatial variance parameter separately for each gene), but would be extremely useful if it is somehow possible.

6.2 Fix bandwidth (or other) parameters?

Would it be biologically meaningful to fix the bandwidth parameter in the kernel? i.e. the phi parameter in BRISC, or the range parameter in GpGp. This is not currently possible in BRISC, but can be done in GpGp. However, when I tried it out, it did not give a biologically sensible final ranking of genes.

From a biological perspective, I am not sure we really want to do this, since we are interested in ranking genes according to the strength of any spatial patterns of expression, which may be correlated with different tissue structures for different genes. However, it is possible that this could make sense in different datasets that we have not looked at yet.

6.3 Multiple samples

Is it possible to fit models that use data from multiple samples (i.e. multiple tissue slides) without first trying to align the x-y coordinates across slides? Aligning x-y coordinates across multiple samples is difficult or impossible both technically and biologically (does it make sense to align if the samples are from slightly different brain tissue areas?), so any way to fit models that gets around this could be quite powerful. For example, fitting a model for a given gene where a single spatial variance parameter is estimated using data from multiple samples.

7 Session info

sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] spatzli_0.99.0              GpGp_0.4.0                 
##  [3] BRISC_1.0.1                 pbapply_1.4-3              
##  [5] rdist_0.0.5                 RANN_2.6.1                 
##  [7] tidyr_1.1.3                 dplyr_1.0.7                
##  [9] scran_1.21.3                scater_1.21.3              
## [11] ggplot2_3.3.5               scuttle_1.3.1              
## [13] STexampleData_1.1.13        ExperimentHub_2.1.4        
## [15] AnnotationHub_3.1.5         BiocFileCache_2.1.1        
## [17] dbplyr_2.1.1                SpatialExperiment_1.3.4    
## [19] SingleCellExperiment_1.15.2 SummarizedExperiment_1.23.4
## [21] Biobase_2.53.0              GenomicRanges_1.45.0       
## [23] GenomeInfoDb_1.29.8         IRanges_2.27.2             
## [25] S4Vectors_0.31.3            BiocGenerics_0.39.2        
## [27] MatrixGenerics_1.5.4        matrixStats_0.60.1         
## 
## loaded via a namespace (and not attached):
##   [1] spam_2.7-0                    igraph_1.2.6                 
##   [3] BiocParallel_1.27.4           digest_0.6.27                
##   [5] htmltools_0.5.2               viridis_0.6.1                
##   [7] magick_2.7.3                  fansi_0.5.0                  
##   [9] magrittr_2.0.1                memoise_2.0.0                
##  [11] ScaledMatrix_1.1.0            cluster_2.1.2                
##  [13] limma_3.49.4                  Biostrings_2.61.2            
##  [15] R.utils_2.10.1                colorspace_2.0-2             
##  [17] blob_1.2.2                    rappdirs_0.3.3               
##  [19] ggrepel_0.9.1                 xfun_0.25                    
##  [21] crayon_1.4.1                  RCurl_1.98-1.4               
##  [23] jsonlite_1.7.2                glue_1.4.2                   
##  [25] gtable_0.3.0                  zlibbioc_1.39.0              
##  [27] XVector_0.33.0                DelayedArray_0.19.2          
##  [29] kernlab_0.9-29                BiocSingular_1.9.1           
##  [31] DropletUtils_1.13.2           Rhdf5lib_1.15.2              
##  [33] maps_3.3.0                    HDF5Array_1.21.0             
##  [35] scales_1.1.1                  DBI_1.1.1                    
##  [37] edgeR_3.35.1                  Rcpp_1.0.7                   
##  [39] viridisLite_0.4.0             xtable_1.8-4                 
##  [41] dqrng_0.3.0                   bit_4.0.4                    
##  [43] rsvd_1.0.5                    dotCall64_1.0-1              
##  [45] metapod_1.1.0                 httr_1.4.2                   
##  [47] FNN_1.1.3                     ellipsis_0.3.2               
##  [49] pkgconfig_2.0.3               R.methodsS3_1.8.1            
##  [51] farver_2.1.0                  sass_0.4.0                   
##  [53] locfit_1.5-9.4                utf8_1.2.2                   
##  [55] tidyselect_1.1.1              labeling_0.4.2               
##  [57] rlang_0.4.11                  later_1.3.0                  
##  [59] AnnotationDbi_1.55.1          munsell_0.5.0                
##  [61] BiocVersion_3.14.0            tools_4.1.1                  
##  [63] cachem_1.0.6                  generics_0.1.0               
##  [65] RSQLite_2.2.8                 evaluate_0.14                
##  [67] stringr_1.4.0                 fastmap_1.1.0                
##  [69] yaml_2.2.1                    knitr_1.33                   
##  [71] bit64_4.0.5                   purrr_0.3.4                  
##  [73] KEGGREST_1.33.0               sparseMatrixStats_1.5.3      
##  [75] mime_0.11                     R.oo_1.24.0                  
##  [77] compiler_4.1.1                beeswarm_0.4.0               
##  [79] filelock_1.0.2                curl_4.3.2                   
##  [81] png_0.1-7                     interactiveDisplayBase_1.31.2
##  [83] tibble_3.1.4                  statmod_1.4.36               
##  [85] bslib_0.3.0                   stringi_1.7.4                
##  [87] highr_0.9                     fields_12.5                  
##  [89] lattice_0.20-44               bluster_1.3.2                
##  [91] Matrix_1.3-4                  vctrs_0.3.8                  
##  [93] pillar_1.6.2                  lifecycle_1.0.0              
##  [95] rhdf5filters_1.5.0            BiocManager_1.30.16          
##  [97] jquerylib_0.1.4               BiocNeighbors_1.11.0         
##  [99] bitops_1.0-7                  irlba_2.3.3                  
## [101] httpuv_1.6.2                  R6_2.5.1                     
## [103] promises_1.2.0.1              gridExtra_2.3                
## [105] vipor_0.4.5                   codetools_0.2-18             
## [107] assertthat_0.2.1              rhdf5_2.37.0                 
## [109] rjson_0.2.20                  withr_2.4.2                  
## [111] GenomeInfoDbData_1.2.6        grid_4.1.1                   
## [113] beachmat_2.9.1                rmarkdown_2.10               
## [115] DelayedMatrixStats_1.15.4     shiny_1.6.0                  
## [117] ggbeeswarm_0.6.0